CHL5230-Appplied Machine Learning for Health Data

Instructor: Zahra Shakeri– Fall 2023
Dalla Lana School of Public Health-University of Toronto
Datathon #1


Datathon Description and Instructions

Datathon Context and Objective

In the modern age of data-driven decision-making, public health is set to benefit from the insights that can be derived from data analytics. As many nations grapple with health crises, understanding the complex determinants of health outcomes and creating timely interventions is paramount. The datasets provided for this datathon give a clear view into specific behavioral and physical characteristics of individuals. These primarily focus on the challenge of obesity, which is a societal concern affecting community health, public expenses, and the overall well-being of society.

In Canada, data from 2019 indicates that nearly two-thirds of adults were either overweight or obese. Additionally, about one-third of children aged between 5 and 17 were categorized as overweight or obese1. These numbers are concerning, especially when considering the association of obesity with chronic diseases such as Type 2 diabetes, heart disease, and certain cancers2.

This datathon have two main objectives. The first is to thoroughly analyze these datasets, extracting patterns that can guide decision-makers in developing efficient surveillance and intervention strategies. The second objective involves a dataset from a research paper on lung cancer risk factors in Ethiopia. This offers an opportunity to correlate and integrate findings from two different health challenges: obesity and lung cancer. The main task is to identify underlying patterns and risk factors, and then create models that are useful for public health planning.

Dataset Information

Dataset #1: Lung Cancer Risk Factors from Tikur Ambesa Hospital

This dataset is from Tikur Ambesa Hospital in Addis Ababa, Ethiopia. It provides a detailed overview of patient medical histories, findings from physical examinations, laboratory test results, diagnoses from consultants, treatment details, and notes from various healthcare professionals.

The dataset focuses on lung cancer risk factors, containing information from 1,000 lung cancer patients of different severity levels and 465 individuals who were screened for lung cancer but were found healthy. Although 15 risk factors for lung cancer are generally recognized, medical professionals at Tikur Ambesa Hospital have identified 11 as the most important. The dataset categorizes lung cancer severity based on the stage of the disease, allowing for a detailed understanding of its progression.

The dataset includes records from 872 women and 593 men, ranging in age from 14 to 73 years. This wide age range provides a valuable perspective on how age interacts with other lung cancer risk factors.

Dataset #2: Public Health Factors Influencing BMI

This dataset contains information on several public health factors that might influence BMI in Canada. It includes ten variables: age, gender, daily caloric intake, physical activity, smoking status, alcohol consumption, hours of sleep, screen time, cholesterol levels, and blood pressure. Each of these factors is known to affect health and, in turn, BMI.

The dataset consists of 23,535 entries, offering a comprehensive view of the Canadian population. It intentionally has a distribution split of 37%-63%, reflecting the kind of demographic imbalances that are often seen in real-world data.

Though the main focus of this dataset is on factors that influence BMI, its breadth makes it suitable for various analyses, including machine learning techniques such as K-NN for classification and clustering.

The datasets can be found at Modules/Datathon #1, and they will be provided at 6:45 pm on Tuesday, September 19, 2023 .

Instructions for Submission

You are encouraged to discuss your work with your teammates and other teams and can use online and offline resources. However, all members of your team should make substantial, meaningful contributions to your submission, ensuring fairness to all participating teams in this datathon. Teams must submit the following materials by the 8:00 PM in-class deadline and the final deadline at 2:00 PM. It is advisable for teams to work consistently from the outset on deliverables rather than attempting to complete them all within the last hour. You should begin work on the deliverables at least three days before the deadline.

Components of Submission

1. Low-fidelity Prototype (In-class Submission)

The first phase of this Datathon involves collaborative efforts among students, aimed at transforming the provided datasets into actionable insights. Teams should formulate research questions and outline their data analysis plans, followed by submitting a low-fidelity prototype of their solution to Assignments/Datathon#1/Low-fidelity Prototype. Please adhere to the naming convention outlined later in this document when naming your one-page PDF submission for today.

Every team is required to submit their low-fidelity prototype through Quercus by 8:00 PM on September 19, 2023. A successful submission should include a clear and legible list of research questions that you plan to address using the provided datasets. Additionally, provide a detailed plan specifying the analysis methods (e.g. machine learning) you intend to employ for addressing these questions. Ensure that each research question corresponds to its respective analysis plan.

Please note that you are not obligated to finalize your solution or research questions at this stage. If you come up with a better idea during the week, feel free to update your plan. The primary goal of the low-fidelity milestone is to initiate the brainstorming phase of a data science project, which is typically the initial and most critical phase. It allows you to see how the project’s direction may evolve during your analysis.

2. A High-fidelity Prototype

All teams are expected to submit their analysis results and deliver brief presentations (2 minutes for the presentation, followed by 1 minute for questions) consisting of a minimum of 2 and a maximum of 3 slides. The purpose of these presentations is to guide your instructor and TA(s) on how you leveraged the available data to address the research question you formulated.

During your presentation, cover essential elements, including meaningful results, the data analysis process, challenges encountered, and key findings. While you have the flexibility to decide the presentation’s content, it should focus on conveying a clear understanding of the analytical process, findings, and conclusions. In essence, the presentation should provide a condensed version of the written report.

To allow the TA to prepare teams’ presentations effectively, it is imperative that teams finalize their submissions by 2:00 PM on September 26, 2023.

3. A Written Report

Teams are required to compile a report that details the steps taken to address their proposed question or prompt. While there is not a prescribed format for the report, it should encompass key sections such as:

  • Introduction: Explain the questions you aimed to answer with the data and their significance.
  • Data Engineering Process: Describe how you cleaned and prepared the data and specify the datasets used.
  • Analysis: Outline the learning and analysis techniques employed, along with the rationale behind their selection.
  • Findings: Present your discoveries and insights.
  • Conclusion: Summarize what health practitioners can infer from your team’s work.
  • Individual Contributions: Highlight the contributions of each team member throughout the entire process.
  • Code and Presentation: Host your Datathon materials, including notebooks and datasets, on GitHub. Share the GitHub project link in the report for easy access by the TA. Also, utilize Google Presentation to host your presentation and provide the public link in the report.

Note: When submitting your report to Quercus, please consolidate all components into one PDF file and include links to other relevant elements within the report. Name your file following the format: Team Number-CHL5230-F23 (e.g., 25-CHL5230-F23.PDF). Submissions not adhering to this naming convention will not be considered for grading. Additionally, ensure that you include your team number and the names of all team members in your report.

At a minimum, the report should cover the question addressed, findings, the data analysis process, and a conclusion. The report must not exceed two pages in length. While the code should be functional and produce the reported results, it will not be evaluated based on code quality.

Ensure that all materials are submitted by 2:00 pm, September 26th. Unfortunately, no late submissions will be accepted.

This Datathon is pretty free-form! This is intentional; projects you work on in industry will rarely be very specific. Please feel free to show early results to me to get some feedback you can use to ensure a successful submission!

Why Github?

Firstly, if you are not already familiar with GitHub, it is a widely used platform for version control and collaborative development. In the context of our course, it plays a crucial role in facilitating effective teamwork and project management.

To help you get started with GitHub, I recommend watching this concise 10-minute tutorial on GitHub Desktop. This tutorial covers the basics, making it an excellent resource for beginners. Understanding GitHub is essential because data science projects, like the ones you will be working on, often involve multiple team members collaborating on code, data, and documents. GitHub provides the tools needed to track changes, coordinate efforts, and ensure that everyone is on the same page.

By utilizing GitHub for your team collaborations and submissions, you not only streamline your work but also contribute to a more organized and transparent project environment. It allows us to monitor individual contributions to each submission, which is valuable in assessing teamwork and participation.

Lastly, when you create your GitHub ID, please consider including your name in it. This makes it easier for your teammates, TA(s), and peers to identify you and fosters a sense of clarity and professionalism in your collaborative projects.

Important Dates

Component Due Time Where to Submit?
Data Availability September 19, 6:45 pm Modules/Datathon #1
Low-fidelity Prototype September 19, 8:00 pm Assignments/Datathon #1/Low-fidelity Prototype
Written Report September 26, 2:00 pm Assignments/Datathon #1/Written Report

  1. “Overweight and obesity in adults, 2018.” Statistics Canada. Link↩︎

  2. “Obesity and Overweight.” World Health Organization. Link↩︎